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Motivating Trends in Computer Architecture 



Data collected by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, C. Batten 
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Vertically Integrated Research Methodology 


Our research involves reconsidering all aspects of the computing stack 
including applications, programming frameworks, compiler optimizations, 
runtime systems, instruction set design, microarchitecture design, VLSI 
implementation, and hardware design methodologies 
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Loop Dependence Pattern Specialization 
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Key Challenge: Creating HW/SW abstractions that are flexible 
and enable performance-portable execution 
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Explicit Loop Specialization (XLOOPS) 

Key Idea 1 : Expose fine-grained parallelism by elegantly encoding 
inter-iteration loop dependence patterns in the ISA 

Key Idea 2: Single-ISA hetereogenous architecture with a new execution 
paradigm supporting traditional, specialized, and adaptive execution 



► Traditional Execution 

► Specialized Execution 

► Adaptive Execution 
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1. XLOOPS Instruction Set 

loop: 

lw r2, 0(rA) 

lw r3, 0(rB) 


2. XLOOPS Compiler 

#pragma xloops ordered 
for(i = 0; i < N i++) 
A[i] = A[i] * A[i-K]; 


addiu.xi rA, 4 
addiu.xi rB, 4 
addiu rl, rl, 
xloop.uc rl, rN, 


1 

loop 


#pragma xloops atomic 
for(i =0; i < N; i++) 
B[ A[i] ]++; 

D[ C[i] ]++; 


3. XLOOPS Microarchitecture 4. Evaluation 
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1. XLOOPS Instruction Set 

loop: 

lw r2, 0(rA) 

lw r3, 0(rB) 


addiu.xi rA, 4 
addiu.xi rB, 4 
addiu rl, rl, 
xloop.uc rl, rN, 
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loop 


3. XLOOPS Microarchitecture 
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2. XLOOPS Compiler 

#pragma xloops ordered 
for(i = 0; i < N i++) 
A[i] = A[i] * A[i-K]; 

#pragma xloops atomic 
for(i =0; i < N; i++) 
B[ A[i] ]++; 

D[ C[i] ]++; 

4. Evaluation 
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XLOOPS Instruction Set Extensions 
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XLOOPS Instruction Set: Unordered Concurrent 


Element-wise Vector 
Multiplication 


for ( i=0; i<N; i++ ) 
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► Instructions in loop cannot 
write live-in registers 

► Live-out values must be stored 
to memory 

► Data-races are possible 
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XLOOPS Instruction Set: Unordered Atomic 


Histogram 

Updates 


for ( i=0; i<N; i++ ) 
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► Iterations execute atomically 

► No race conditions 

► Results can be non-deterministic 

► Inspired by Transactional Memory 
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XLOOPS Instruction Set: Ordered-Through-Registers 
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► rX - Cross Iteration Register 

► CIRs are guranteed to have 
the same value as a serial 
execution 

► Inspired by Multiscalar 
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XLOOPS Instruction Set: Ordered-Through-Memory 


for ( i=k; i<N; i++ ) 
A[i] = A[i] * A[i-k]? 
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► Updates to memory defined by 
serial iteration order 

► No race conditions 

► Inspired by Multiscalar, TLS 
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XLOOPS Instruction Set: Dynamic Bound 
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1. XLOOPS Instruction Set 

loop: 

lw r2, 0(rA) 

lw r3, 0(rB) 


addiu.xi rA, 4 
addiu.xi rB, 4 
addiu rl, rl, 
xloop.uc rl, rN, 


1 

loop 


3. XLOOPS Microarchitecture 


2. XLOOPS Compiler 

#pragma xloops ordered 
for(i = 0; i < N i++) 
A[i] = A[i] * A[i-K]; 

#pragma xloops atomic 
for(i =0; i < N; i++) 
B[ A[i] ]++; 

D[ C[i] ]++; 


4. Evaluation 
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XLOOPS Compiler 


Kernel implementing Floyd-Warshall shortest path algorithm 

for ( int k = 0; k < n; k++ ) 

#pragma xloops ordered 

for ( int i = 0; i < n; i++ ) 

#pragma xloops unordered 
for ( int j =0; j <n; j++) 

path[i] [j] = min( path[i] [j] , path[i] [k] + path[k] [j] ) ; 
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► Programmer annotations 

> unordered: no data-dependences 

> ordered: preserve data-dependences 
o atomic: atomic memory updates 

► Loop strength reduction pass encodes 
MIVs as xi instructions 

► XLOOPS data-dependence analysis pass 

i> Register-dependence: analysing use-definition 
chains through PHI nodes 

> Memory-dependence: well known 
dependence analysis techniques 

► Detect updates to the loop bound to encode 
dynamic-bound control-dependence pattern 


Cornell University 


Christopher Batten 


19/53 





































Research Overview XLOOPS ISA XLOOPS Compiler • XLOOPS uArch • XLOOPS Evaluation PyMTL Pydgin 


1. XLOOPS Instruction Set 

loop: 

lw r2, 0(rA) 

lw r3, 0(rB) 


2. XLOOPS Compiler 

#pragma xloops ordered 
for(i = 0; i < N i++) 
A[i] = A[i] * A[i-K]; 


addiu.xi rA, 4 
addiu.xi rB, 4 
addiu rl, rl, 
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loop 


#pragma xloops atomic 
for(i =0; i < N; i++) 
B[ A[i] ]++; 

D[ C[i] ]++; 



4. Evaluation 
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Traditional Execution 



Minimal changes to a 
general-purpose processor (GPP) 

► xloop —> bne 

► addiu.xi —» addiu 

► addu.xi —» addu 

Efficient traditional execution 

► Enables gradual adoption 

► Enables adaptive execution to 
migrate an xloop instruction 
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Specialized Execution - xloop.uc 



Loop Pattern Specialization Unit 

► Lane Management Unit (LMU) 

► Four decoupled in-order lanes 

► Lanes contain instruction buffers 
and index queues 

► Lanes and the GPP arbitrate for 
data-memory port and 
long-latency functional unit 

Specialized execution 

► Scan phase 

► Specialized execution phase 
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loop: 

lw r2, 0(rA) 
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xloop.uc rl, rN, loop 
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Specialized Execution - xloop. or 



► Cross-iteration buffers (CIBs) 
forward register-dependences 

► LMU control logic 

> Cross-iteration registers (CIRs) 

> Last update to a ClR 

► Lane control logic 

t> Stall if CIR is not available 

> If last update to CIR then write to 
the next CIB 
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Specialized Execution - xioop. om 



► LSQ to support hardware 
memory disambiguation 

► LMU control logic 

> Track non-speculative vs. 
speculative lanes 

> Promote lanes to be 
non-speculative 

► Lane control logic 

t> Handle structural hazards 
t> Handle dependence violations 
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Supporting other patterns 



► xloop. ua - Using xloop. om 
mechanisms 

► xloop. orm — Combine xloop. or 
and xloop. om mechanisms 

► xloop.*.db 

t> Lanes communicate updates to 
loop bound 

t> LMU tracks maximum bound and 
generates additional work 
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Adaptive Execution 



► Some kernels have higher 
performance on LPSU (e.g., 
significant inter-iteration parallelism) 

► Some kernels have higher 
performance on GPP (e.g., limited 
inter-iteration parallelism, significant 
intra-iteration parallelism) 


► Approach #1 : Move to more complicated superscalar or out-of-order 
lanes to better exploit both inter- and intra-iteration parallelism 

► Approach #2: Adaptively migrate between traditional and specialized 
execution to achieve best performance 
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► Migrating loop on 
iteration boundaries 
is very cheap and 
usually only requires 
sending the next 
iteration index 

► An adaptive profiling 
table in GPP records 
profiling progress for 
small number of 
recently seen xloop 
instructions 
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1. XLOOPS Instruction Set 

loop: 

lw r2, 0(rA) 

lw r3, 0(rB) 


2. XLOOPS Compiler 

#pragma xloops ordered 
for(i = 0; i < N i++) 
A[i] = A[i] * A[i-K]; 


addiu.xi rA, 4 
addiu.xi rB, 4 
addiu rl, rl, 
xloop.uc rl, rN, 


1 

loop 


#pragma xloops atomic 
for(i =0; i < N; i++) 
B[ A[i] ]++; 

D[ C[i] ]++; 


3. XLOOPS Microarchitecture 


Lane Manager 



4. Evaluation 
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Application Kernels 

xloop.uc 

xloop.or 

Color space conversion 

ADPCM decoder 

Dense matrix-multiply 

Covriance computation 

String search algorithm 

Floyd-Steinberg dithering 

Symmetric matrix-multiply 

K-Means clustering 

Viterbi decoding algorithm 

SHA-1 encryption kernel 

Floyd-Warshall shortest path 

Symmetric matrix-multiply 

xloop.om 

xloop.orm, xloop.ua 

Dynamic-programming 

Greedy maximal-matching 

K-Nearest neighbors 

2D Stencil computation 

Knapsack kernel 

Binary tree construction 

Floyd-Warshall shortest path 

Heap-sort computation 


Huffman entropy coding 


Radix sort algorithm 

xloop.uc.db 


Breadth-first search 

25 Kernels: MiBench, 


Quick-sort algorithm 

PolyBench, PBBS, custom 
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Cycle-Level Evaluation Methodology 



gem5 


► LLVM-3.1 based compiler framework 

► gem5 - in-order and out-of-order processors 

► PyMTL - LPSU models 

► McPAT-1.0 - 45nm energy models 



PyMTL 
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Energy-Efficiency vs. Performance Results 


In-order+LPSU OOO 2-way+LPSU 000 4-way+LPSU 

vs. In-order Core vs. 000 2-Way vs. 000 4-Way 



► XLOOPS vs. Simple Core : Similar energy efficiency, higher power 

► XLOOPS vs. 000 2-way : Higher energy efficiency, mixed power 

► XLOOPS vs. 000 4-way : Higher energy efficiency, lower power 

► Adaptive execution trades energy efficiency for performance 

► Profiling and migration cause minimal performance degredation 
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32b Integer 
Mul/Div Unit 


DCache 

16KB SRAM for Cache Lines 


DCache 

Tags 


ICache 

16KB SRAM for Cache Lines 


ICache 

Tags 


Loop Pattern 
Specialization Unit 


32b IEEE 
Floating Point Unit 


. 


Scalar 

Processor 


VLSI 

Implementation 

► TSMC40nm 
standard-cell-based 
implementation 

► RISC scalar 
processor with 
4-lane LPSU 

► Supports xloop.uc 

► —40% extra area 
compared to simple 
RISC processor 
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loop: 

lw 

r2, 

0(rA) 

lw 

r3. 

0(rB) 

addiu.xi 

rA, 

4 

addiu.xi 

rB, 

4 

addiu 

rl. 

rl, 1 

xloop.uc 

rl. 

rN, loop 


XLOOPS Take-Away Points 


► XLOOPS is an elegant new abstraction that 
enables performance-portable execution of loops 


#pragma 
for(i = 
A [ i ] = 


xloops 
0; i < 
A [ i ] * 


ordered 
N i++) 
A[i-K]; 

atomic 
N; i++) 


#pragma xloops 
for(i =0; i < 
B[ A[i] ]++; 
D[ C[i] ]++; 


OoO GPP 


Lane Manager 



LI Data Cache 


► XLOOPS enables a single-ISA heterogeneous 
architecture with a new execution paradigm 

o Traditional Execution 
t> Specialized Execution 
t> Adaptive Execution 

► XLOOPS is able to achieve higher performance 
compared to simple in-order cores and improved 
energy efficiency compared to complex 
out-of-order cores 
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•PyMTL Pydgin 


PyMTL: A Unified Framework for 
Vertically Integrated Computer 
Architecture Research 


Derek Lockhart, Gary Zibrat, 
Christopher Batten 

47th ACM/IEEE Int’l Symp. on 
Microarchitecture (MICRO) 
Cambridge, UK, Dec. 2014 


Pydgin: Generating Fast 
Instruction Set Simulators from 
Simple Architecture Descriptions 
with Meta-Tracing JIT Compilers 

Derek Lockhart, Berkin llbeyi, 
Christopher Batten 

IEEE Int’l Symp. on Perf Analysis of 
Systems and Software (ISPASS) 
Philadelphia, NJ, Mar. 2015 
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Computer Architecture Research Methodologies 



* > 

Instruction Set Architecture 





Functional-Level Modeling 

- Behavior 

Cycle-Level Modeling 

- Behavior 

- Cycle-Approximate 

- Analytical Area, Energy, Timing 


Register-Transfer-Level Modeling 

- Behavior 

- Cycle-Accurate Timing 

- Gate-Level Area, Energy, Timing 
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Computer Architecture Research Methodologies 


Computer Architecture 
Research Methodology Gap 

FL, CL, RTL modeling 
use very different 
languages, patterns, 
tools, and methodologies 



Functional-Level Modeling 

- Algorithm/ISA Development 

- MATLAB/Python, C++ ISA Sim 

Cycle-Level Modeling 

- Design-Space Exploration 

- C++ Simulation Framework 

- SW-Focused Object-Oriented 

- gem5, SESC, McPAT 

Register-Transfer-Level Modeling 

- Prototyping & AET Validation 

- Verilog, VHDL Languages 

- HW-Focused Concurrent Structural 

- EDA Toolflow 
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Great Ideas From Prior Work 


Concurrent-Structural Modeling 

(Liberty, Cascade, SystemC) 

Unified Modeling Languages 

(SystemC) 

Hardware Generation Languages 

(Chisel, Genesis2, BlueSpec, MyHDL) 

HDL-Integrated Simulation Frameworks 

(Cascade) 

Latency-Insensitive Interfaces 

(Liberty, BlueSpec) 


Consistent interfaces across abstractions 


Unified design environment for FL, CL, RTL 
Productive RTL design space exploration 

Productive RTL validation and cosimulation 


Component and test bench reuse 
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What is PyMTL? 


• A Python DSELfor concurrent-structural hardware modeling 

• A Python API for analyzing models described in the PyMTL DSEL 

• A Python tool for simulating PyMTL FL, CL, and RTL models 

• A Python tool for translating PyMTL RTL models into Verilog 

• A Python testing framework for model validation 




Testing Framework 
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What Does PyMTL Enable? 


• Incremental refinement from algorithm to accelerator implementation 

• Automated testing and integration of PyMTL-generated Verilog 
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What Does PyMTL Enable? 


• Incremental refinement from algorithm to accelerator implementation 
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What Does PyMTL Enable? 


• Incremental refinement from algorithm to accelerator implementation 

• Automated testing and integration of PyMTL-generated Verilog 

• Multi-level co-simulation of FL, CL, and RTL models 

• Construction of highly-parameterized RTL chip generators 

• Embedding within C++ frameworks & integration of C++/Verilog models 

(Used to implement CL model for XLOOPS LPSU) 
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The PyMTL Framework 
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The PyMTL Framework 


Specification 


Tools 


Output 


Test & Sim 

Simulation 


Traces & 

Harness 

A To01 J 


VCD 



But isn’t Python too slow? 
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Performance/Productivity Gap 


Python is growing in popularity in many domains of scientific and 
high-performance computing. How do they close this gap? 

► Python-Wrapped C/C++ Libraries 
(NumPy, CVXOPT, NLPy, pythonoCC, gem5) 

► Numerical Just-In-Time Compilers 
(Numba, Parakeet) 

► Just-In-Time Compiled Interpreters 
(PyPy, Pyston) 

► Selective Embedded Just-In-Time Specialization 
(SEJITS) 
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PyMTL SimJIT-RTL Architecture 


PyMTL 
RTL Model 
Instance 
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CFFI Model 
Instance 


Cornell University 


Christopher Batten 


44/53 




























































Research Overview XLOOPS ISA XLOOPS Compiler XLOOPS uArch XLOOPS Evaluation • PyMTL • Pydgin 

PyMTL Results: 64-Node Mesh Network 


Simulation Time Simulation Time 

Including Compile Time Excluding Compile Time 



Simulated Cycles Simulated Cycles 


Verilator 

SimJIT 

+PyPy 

SimJIT 


PyPy 

CPython 


RTL model of 64-node mesh network with single-cycle routers, elastic buffer 
flow control, uniform random traffic, with an injection rate just before saturation 
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PyMTL ASIC Tapeout 
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Layout generated from PyMTL for 
simple processor, LI memory 
system, dot product xcel 

Target Tech: 2x2mm IBM 130nm 


Xilinx ZC706 FPGA development board 
for FPGA prototyping 

Custom designed FMC mezzanine card 
for ASIC test chips 
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* PyMTL ^Pydgin 


PyMTL: A Unified Framework for 
Vertically Integrated Computer 
Architecture Research 


Derek Lockhart, Gary Zibrat, 
Christopher Batten 

47th ACM/IEEE Int’l Symp. on 
Microarchitecture (MICRO) 
Cambridge, UK, Dec. 2014 


Pydgin: Generating Fast 
Instruction Set Simulators from 
Simple Architecture Descriptions 
with Meta-Tracing JIT Compilers 

Derek Lockhart, Berkin llbeyi, 
Christopher Batten 

IEEE Int’l Symp. on Perf Analysis of 
Systems and Software (ISPASS) 
Philadelphia, NJ, Mar. 2015 
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Computer Architecture Research Methodologies 


While it is certainly possible to 
create stand-alone instruction 
set simulators in PyMTL, 
their performance is quite slow 
(-100 KIPS) 

Can we achieve 
high-performance while 
maintaining productivity 
for instruction set simulators? 



Functional-Level Modeling 

- Algorithm/ISA Development^ 

- MATLAB/Python(C++ ISA Sim) 

Cycle-Level Modeling 

- Design-Space Exploration 

- C++ Simulation Framework 

- SW-Focused Object-Oriented 

- gem5, SESC, McPAT 

Register-Transfer-Level Modeling 

- Prototyping & AET Validation 

- Verilog, VHDL Languages 

- HW-Focused Concurrent Structural 

- EDA Toolflow 
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Productivity 



Performance 


Architectural 

Description 

Language 



/ " N 

Instruction Set 

Interpreter in C 
with DBT 

V___/ 


[Simit-ARM2006] 


[Wagstaff2013] 



+ Page-based JIT 

- Ad-hoc ADL with custom parser 

- Unmaintained 

V_ 


+ Region-based JIT 
+ Industry-supported ADL (ArchC) 

- C++-based ADL is verbose 

- Not Public 



[Simit-ARM2006] J.D'Errico and W.Qin. Constructing Portable Compiled Instruction-Set Simulators — An ADL-Driven Approach. DATE'06. 
[Wagstaff2013] H. Wagstaff, M. Gould, B. Franke, and N.Topham. Early Partial Evaluation in a JIT-Compiled, Retargetable Instruction 
Set Simulator Generated from a High-Level Architecture Description. DAC'13. 
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Productivity 



Performance 


Architectural 

Description 

Language 



/ " N 

Instruction Set 

Interpreter in C 
with DBT 

V___/ 



Key Insight: 




Similar productivity-performance challenges for 
building high-performance interpreters of 
dynamic languages. 

(e.g. JavaScript, Python) 



pypy 
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Dynamic Language 

Interpreter in C 

with JIT Compiler 
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Productivity 



Performance 


Architectural 

Description 

Language 



[Simlt-ARM2006] 

[Wagstaff2013] 



Instruction Set 
Interpreter in C 
with DBT 


Dynamic-Language 
Interpreter 
in RPython 



RPython 

Translation 

Toolchain 



pypy 



Dynamic Language 
Interpreter in C 
with JIT Compiler 




Meta-Tracing JIT: 

makes JIT generation generic across languages 
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Productivity 



Performance 


Architectural 

Description 

Language 


Pydg 
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Productivity 



Performance 


Architectural 

Description 

Language 


Pydg 


in 
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Instruction Set 

Interpreter in C 
with DBT 

V___/ 


• Flexible, productive, pseudocode-like ADL syntax 

• ADL embedded in a popular, general-purpose language 

• Tracing-JIT generator applies across many different ISAs 

• Leverages advancements from dynamic-language JIT research 
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Pydgin Results: ARMv5 Instruction Set 
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Pydgin w/o JIT 



Pydgin w/JIT 



SimitARM 


QEMU 


Porting Pydgin to a new user-level ISA takes just a few weeks 
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PyMTL 



PyMTL/Pydgin Take-Away Points 


► PyMTL is a productive Python framework for 
FL, CL, and RTL modeling and hardware design 

► Pydgin is a framework for rapidly developing very 
fast instruction-set simulators from a Python- 
based architecture description language 

► PyMTL and Pydgin leverage novel application of 
JIT compilation to help close the 
performance/productivity gap 

► Alpha versions of PyMTL and Pydgin are 
available for researchers to experiment with at 

https://github.com/cornell-brg/pymtl 

https://github.com/cornell-brg/pydgin 


Cornell University 


Christopher Batten 


51 / 53 






Research Overview XLOOPS ISA XLOOPS Compiler XLOOPS uArch XLOOPS Evaluation PyMTL Pydgin 



Derek Lockhart, Ji Kim, Shreesha Srinath, Christopher Torng, 
Berkin llbeyi, Moyang Wang, and many M.S./B.S. students 


Prof. Zhiru Zhang, Mingxing Tan, Gai Liu 




Equipment and Tool Donations 

Intel, NVIDIA, Synopsys, Xilinx 
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Batten Research Group 


loop: 

lw 
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addiu.xi 
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addiu.xi 
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addiu 

rl. 

rl, 1 

xloop.uc 

rl, 

rN, loop 


#pragma 
for(i = 
A [ i ] = 


xloops 
0; i < 
A [ i ] * 


ordered 
N i++) 

A[i-K]; 

atomic 
N; i++) 


#pragma xloops 
for(i =0; i < 
B[ A[i] ]++; 
D[ C[i] ]++; 



Exploring cross-layer hardware 
specialization using a vertically 
integrated research methodology 


0 

3 

O 

“3 

s_ 

0 

CL 

CO 

rr: 

co 

CO 


o 

c 

0 

H— 
H— 

LU 

> 

CD 

5— 

0 

c 

LU 


Design 
Performance 
Constraint 


Embedded 

Architectures 



Custom 

ASIC 


Less Flexible 
Accelerator 




Design Power 
Constraint 


High-Performance 
Architectures 


Performance (Tasks per Second) 



PyMTL 



Cornell University 


Christopher Batten 


53/53 












































































